SmithServe is going to perform an extensive examination of ATP sports data from 2012 to 2022. The main goal is to give entertainment firms and brands looking for sponsorship information into professional tennis players' performances. The analysis will provide data-driven reports and visualizations that will provide practical insights which will enable wise decision-making in the fast-paced world of professional tennis.
SmithServe's ATP sports data analysis serves the needs of prominent sports entertainment broadcasters—ESPN in the USA and BeIN Sports in Europe.
Telecasting Enhancement and Anchor Discussions :
ESPN can leverage the analysis for enhanced telecasting by providing detailed insights into close calls and player statistics before each game. The data will provide ESPN anchors with comprehensive player information for in-depth discussions during their coverage.
Player Stats and Betting Insights:
BeIN Sports, is an European sports channel that targets consumers who prefer information about a players statistics so that they can use the analysis for betting. Player statistics can be displayed during or before each game. Betting is very popular in Europe, the data SmithServe provides will help give insights that viewers can benefit from which will help improve their experience with BeIN Sports compared to other sport broadcasters in Europe.
Smith Serves ATP sports data analysis serves the needs of DraftKings Sportsbook, a leading betting company.
Informed Betting Decisions and Revenue Optimization:
The analysis shows information on player profiles and performance in various games, providing DraftKings Sportsbook clients a player's probability of winning a game. This helps betters to more informative precise decisions. The analysis will help DraftKings Sportsbook's revenue when it helps clients make better bets which will then make sure that they keep on returning and place more bets through DraftKings Sportsbook because it will be more popular.
Nike, Under Armour, Lactose, and Sergio Tacchini will want to make sure that they associate their brands with with the correct player. Picking a player who will expand and upvote their brand is the goal therefore they will want to make sure to pick a player who has excellence, precision, and great performance.
SmithServe was able to gather data through Kaggle about ATP (Association of Tennis Professionals) sports that spans from 1968 through 2023 (https://www.kaggle.com/datasets/warcoder/atp-tennis-rankings-results-and-stats1968-2023). However, SmithServe focused on data that range from 2001 - 2023 in order to produce more relevant and current information.
The data specifically contains detailed information about every minute detail related to tennis and that explains the 49 columns and almost 7000 observations over the years spanning from 1968 to 2023. The columns present in the dataset are:
| Columns | Description |
|---|---|
| tourney_id | Tennis tournament ID |
| tourney_name | Name of the tournament |
| surface | Type of court ground on which the tournament was played |
| Draw_size | Number of players playing in the tournament |
| tourney_level | Level of the tournament being played: whether it is a grand-slam or an Masters tournament or an ATP 500 and so on |
| tourney_date | Date the tournament was played on |
| match_num | Match number within the tournament |
| winner_id | Winner seed data |
| Winner_seed | Seeding of the winning player |
| Winner_entry | Special tournament entry condition for the winner |
| Winner_name | Name of the winning player |
| Winner_hand | The hand of the winning player (R/L/U) |
| Winner_ht | Height of the winning player |
| Winner_ioc | Country of origin of the winning player |
| Winner_age | Indicates the age of the winning player |
| Loser_id | Loser seed data |
| Loser_seed | Seeding of the losing player |
| Loser_entry | Special tournament entry condition for the loser |
| Loser_name | Name of the losing player |
| Loser_hand | The hand of the losing player (R/L/U) |
| Loser_ht | Height of the losing player |
| Loser_ioc | Country of origin of the losing player |
| Loser_age | Indicates the age of the losing player |
| Score Final | score of the game |
| Best_of | Number of sets played in a game |
| Round | Round that was played |
| minutes | Number of minutes a match was played for |
| W_ace | Number of aces won in a single match |
| w_df | Winning double faults while serving for the winner of the game |
| w_svpt | Number of service points won by the winning player |
| w_1stIn | Number of serves that have landed in for the winner |
| w_1stWon | How many points the winner won when their first serve landed in |
| w_2ndWon | How many points the winner won when their second serve landed in |
| w_SvGms | Total number of service games played by the winner |
| w_bpSaved | Number of times the winner was able to win the point when their opponent had a chance to break their serve |
| w_bpFaced | Total number of times the winner faced a breakpoint |
| l_ace | How many aces the loser hit |
| l_df | Winning double faults while serving for the losing player of the game |
| l_svpt | Number of service points won by the losing player |
| l_1stIn | Number of serves that have landed in for the losing player |
| l_1stWon | How many points the losing player won when their first serve landed in |
| l_2ndWon | How many points the losing player won when their second serve landed in |
| l_SvGms | Total number of service games played by the losing player |
| l_bpSaved | Total number of times the losing player saved a breakpoint |
| l_bpFaced | Total number of times the losing player faced a breakpoint |
| Winner_rank | The rank of the winning player amongst the active players |
| Winner_rank_points | Total number of ranking points that the winning player has accumulated at the time of a particular match |
| Loser_rank | Rank of the losing player amongst the active players |
| Loser_rank_points | Total number of ranking points that the losing player has accumulated at the time of a particular match |
However, filtering and cleaning of this original data is required for better understanding and further analysis. Since it is a huge dataset, many columns are not relevant to our analysis and hence might be required to be dropped to make our analysis more specific. Also, the column names are very specific to Tennis terminology which needs to be changed in human readable terms for making better decisions for the analysis. All these tasks are performed during the data cleaning process.
Hence, after dropping the above mentioned columns, the remaining columns were decided to be renamed as it would be much easier to understand the terms better and then conduct the analysis.
Sponsorship companies such as Lactose and Sergio Tacchini will want to sponsor a player that has meaning to the brand. Knowing which tournament a player wins the most will allow the brand who is from that area to decide which player they should associate with.
Sports entertainment companies like our clients ESPN and BIeN Sports will need the most reliable information about the top players when they are reporting and debating tennis matches. The anchors will be able to tell a story on how a certain player wins more or if the players are a tie most of the time. This will differentiate the broadcasters from different sports channels. The differentiation will help bring in viewers because the anchors are providing more information on the players and relating back to historical data that the analysis will provide.
Total number of Aces, Total Breakpoints saved by Winner and Loser and The Type of Surface. After executing the model, we find a 99% accuracy and a prediction of the winner.Our work on data analysis answers important questions regarding players’ performance on all ground types and players’ stats analysis to predict winners in future matches. We’ve implemented a machine learning model to go above and beyond to predict a winner between two user-entered player names on a particular ground type. This not only allowed us to answer the analysis questions posed in the project but also to make predictions about future data. Additionally, we’ve also worked on building a comprehensive documentation, including clear steps for data analysis with comments, markdowns and discussion on inferences based on our findings. Instead of simply presenting the results in a table or chart, we’ve interpreted the results in the context of the research question, discussed their implications, and suggested areas for future research. We believe our analysis adds value to the project and demonstrates a high level of competence in data analysis.
We have multiple datasets corresponding to the different years of the ATP (Association of Tennis Professionals) stats, ranging from the years 2001 - 2023 (about 2 decades). Our objective is to bring this data together in a simple and unified view by integrating and consequently cleaning it.
By integrating these datasets, we achieve several benefits:
csv file that contains all the data and is easier to access.OS python library can automate this process. The integration process includes the following steps:
OS python library , generally used for miscelleanous operating system functions along with some other functions (.endswith(suffix)). (documentation for OS: https://docs.python.org/3/library/os.html). From this library, we use these functions:os.listdir(directory): This function is used to list all the files inside any specific directory.os.path.join(directory, filename): this function concatenates directory with filename.
.endswith(suffix): This is a string manipulation method used in python, this checks if a string ends witth a specific suffix. We use this to to check for .csv files in a given directory.
The final output will be a comprehensive ATP stats dataset that covers matches, players' statistics, rankings, and other relevant information from 2001 to 2023. This dataset will serve as a common dataframe for future analysis by researchers, sports analysts, commentators, and tennis enthusiasts.
(documentation for pandas: https://pandas.pydata.org/docs/user_guide/index.html)
This is a high-level data analysis and manipulation library used in python. his library provides a wide array of functions that compliment numpy library.
(documentation for numpy: https://numpy.org/doc/1.26/user/index.html#user)
Numpy is a fundamental package used for scientific computing in python. There are many functions inside this library that can help.
# Import statements
import os
import numpy as np
import pandas as pd
from numpy import nan as NA
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import dash
from dash import html, dcc, Input, Output
# Loading the dataset directory:
atp_directory = r'D:\University of Maryland\ALL SEMESTERS\FALL 2023\TERM A\BUDT 704 - Data Processing and Analysis using Python\Group Project\tennis_atp_data'
combined_atp_filename = r'D:\University of Maryland\ALL SEMESTERS\FALL 2023\TERM A\BUDT 704 - Data Processing and Analysis using Python\Group Project\combined_atp_data.csv'
# Merging the datasets into one:
dataframes = []
# Iterating each CSV file present in the ATP directory
for filename in os.listdir(atp_directory):
if filename.endswith('.csv'):
# Construct the full file path
file_path = os.path.join(atp_directory, filename)
# Extract the year from the filename (assuming it's in the filename)
year = int(filename.split('.')[0][-4:])
# Read the CSV file
df = pd.read_csv(file_path)
# Append a year column
df['year'] = year
# Append the DataFrame to the list
dataframes.append(df)
# Concatenate all the DataFrames in the list
combined_df = pd.concat(dataframes)
# Sort the DataFrame based on the year column
combined_df.sort_values('year', inplace = True)
# If you want to set the year as an index
combined_df.set_index('year', inplace = True)
# Save the combined DataFrame to a new CSV file
combined_df.to_csv(combined_atp_filename, index = True)
print(f"All CSV files have been combined into {combined_atp_filename}")
All CSV files have been combined into D:\University of Maryland\ALL SEMESTERS\FALL 2023\TERM A\BUDT 704 - Data Processing and Analysis using Python\Group Project\combined_atp_data.csv
# Loading the entire tennis ATP data:
tennis_data = pd.read_csv(r'D:\University of Maryland\ALL SEMESTERS\FALL 2023\TERM A\BUDT 704 - Data Processing and Analysis using Python\Group Project\combined_atp_data.csv')
tennis_data
| year | tourney_id | tourney_name | surface | draw_size | tourney_level | tourney_date | match_num | winner_id | winner_seed | ... | l_1stIn | l_1stWon | l_2ndWon | l_SvGms | l_bpSaved | l_bpFaced | winner_rank | winner_rank_points | loser_rank | loser_rank_points | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2001 | 2001-301 | Auckland | Hard | 32 | A | 20010108 | 1 | 102905 | NaN | ... | 31.0 | 19.0 | 15.0 | 10.0 | 4.0 | 8.0 | 52.0 | 755.0 | 14.0 | 1598.0 |
| 1 | 2001 | 2001-520 | Roland Garros | Clay | 128 | G | 20010528 | 70 | 103990 | NaN | ... | 29.0 | 17.0 | 14.0 | 12.0 | 9.0 | 16.0 | 83.0 | 506.0 | 205.0 | 174.0 |
| 2 | 2001 | 2001-520 | Roland Garros | Clay | 128 | G | 20010528 | 71 | 103694 | NaN | ... | 61.0 | 44.0 | 20.0 | 19.0 | 12.0 | 20.0 | 70.0 | 609.0 | 21.0 | 1295.0 |
| 3 | 2001 | 2001-520 | Roland Garros | Clay | 128 | G | 20010528 | 72 | 102338 | 7.0 | ... | 87.0 | 54.0 | 23.0 | 21.0 | 12.0 | 20.0 | 7.0 | 2605.0 | 112.0 | 371.0 |
| 4 | 2001 | 2001-520 | Roland Garros | Clay | 128 | G | 20010528 | 73 | 103507 | 4.0 | ... | 38.0 | 21.0 | 8.0 | 11.0 | 17.0 | 24.0 | 4.0 | 2830.0 | 220.0 | 161.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67703 | 2023 | 2023-0403 | Miami Masters | Hard | 128 | M | 20230320 | 275 | 111460 | NaN | ... | 49.0 | 35.0 | 15.0 | 11.0 | 7.0 | 9.0 | 79.0 | 705.0 | 55.0 | 865.0 |
| 67704 | 2023 | 2023-0403 | Miami Masters | Hard | 128 | M | 20230320 | 274 | 106421 | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 5.0 | 4330.0 | 56.0 | 865.0 |
| 67705 | 2023 | 2023-0403 | Miami Masters | Hard | 128 | M | 20230320 | 273 | 202103 | 25.0 | ... | 33.0 | 26.0 | 8.0 | 10.0 | 2.0 | 6.0 | 31.0 | 1320.0 | 6.0 | 3415.0 |
| 67706 | 2023 | 2023-0403 | Miami Masters | Hard | 128 | M | 20230320 | 279 | 206173 | 10.0 | ... | 32.0 | 23.0 | 7.0 | 10.0 | 6.0 | 11.0 | 11.0 | 2925.0 | 27.0 | 1450.0 |
| 67707 | 2023 | 2023-M-DC-2023-WG2-PO-TUN-CYP-01 | Davis Cup WG2 PO: TUN vs CYP | Hard | 4 | D | 20230203 | 5 | 121411 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 364.0 | 131.0 | 990.0 | 11.0 |
67708 rows × 50 columns
Data cleaning is a crucial step in the data analysis process. It involves preparing any raw dataset for future analysis by handling the missing values, duplicates, and making the data make sense. Below, we explain common data cleaning operations using Python's pandas and numpy libraries.
NaN values: There are multiple ways of handling NaN values, either we can drop them using the .dropna() function or fill them up with a given value by using .fillna(value). Let us do a descriptive analysis and figure out which of these functions would benefit the analysis most.pandas, you can use the .drop() method. You can specify one or more column names to be removed..rename() function..duplicated() function.# Check for NaN percentage for every variable:
nan_percentage = tennis_data.isna().mean() * 100
print(nan_percentage)
year 0.000000 tourney_id 0.000000 tourney_name 0.000000 surface 0.000000 draw_size 0.000000 tourney_level 0.000000 tourney_date 0.000000 match_num 0.000000 winner_id 0.000000 winner_seed 58.210256 winner_entry 87.404738 winner_name 0.000000 winner_hand 0.013292 winner_ht 2.241980 winner_ioc 0.000000 winner_age 0.005908 loser_id 0.000000 loser_seed 76.924440 loser_entry 79.476576 loser_name 0.000000 loser_hand 0.062031 loser_ht 4.538607 loser_ioc 0.000000 loser_age 0.008862 score 0.000000 best_of 0.000000 round 0.000000 minutes 11.076978 w_ace 8.811366 w_df 8.811366 w_svpt 8.811366 w_1stIn 8.811366 w_1stWon 8.811366 w_2ndWon 8.811366 w_SvGms 8.809890 w_bpSaved 8.811366 w_bpFaced 8.811366 l_ace 8.811366 l_df 8.811366 l_svpt 8.811366 l_1stIn 8.811366 l_1stWon 8.811366 l_2ndWon 8.811366 l_SvGms 8.809890 l_bpSaved 8.811366 l_bpFaced 8.811366 winner_rank 0.734034 winner_rank_points 0.734034 loser_rank 1.949548 loser_rank_points 1.949548 dtype: float64
When we analyzed the data, we saw several columns that we were interested in dropping. The reasons for this vary. For some of the data, most of the values were invalid, either N/A, null, or NaN, and thus the entire column was worthless for analysis. For others, the column was not relevant to our analysis, and as such also had no value to include. They were removed simply for the benefit of having a leaner data set to work with.
# Dropping unwanted columns
tennis_data.drop(['winner_seed', 'loser_seed', 'winner_rank_points', 'loser_rank_points',
'winner_entry','loser_entry', 'w_df', 'l_df', 'tourney_date', 'score',
'draw_size', 'match_num', 'round', 'best_of', 'w_1stWon', 'w_2ndWon',
'l_1stWon', 'l_2ndWon'], axis = 1, inplace = True)
Here, we clean up the column names to make them more human-readable.
# Changing column names
tennis_data.rename(columns={
'tourney_id': 'Tournament ID',
'tourney_name': 'Tournament Name',
'surface': 'Type of Surface',
'tourney_level': 'Level of Tournament',
'winner_id': 'Winner ID',
'winner_name': 'Name of Winner',
'winner_hand': 'Winning Player\'s Hand',
'winner_ht': 'Winning Player\'s Height (in cm)',
'winner_ioc': 'Winning Player\'s Country',
'winner_age': 'Winner\'s Age',
'loser_id': 'Loser ID',
'loser_name': 'Name of Losing Player',
'loser_hand': 'Losing Player\'s Hand',
'loser_ht': 'Losing Player\'s Height (in cm)',
'loser_ioc': 'Losing Player\'s Country',
'loser_age': 'Loser\'s Age',
'minutes': 'Game Duration (minutes)',
'w_ace': 'Winner Aces',
'w_svpt': 'Winner Service Points',
'w_1stIn': 'Winner First Serves In',
'w_SvGms': 'Winner Service Games Played',
'l_ace': 'Loser Aces',
'l_svpt': 'Loser Service Points',
'l_1stIn': 'Loser First Serves In',
'l_SvGms': 'Loser Service Games Played',
'winner_rank': 'Winning Player\'s Rank',
'loser_rank': 'Losing Player\'s Rank',
'year': 'Year',
'w_bpSaved': 'Winner Break Points Saved',
'w_bpFaced': 'Winner Break Points Faced',
'l_bpFaced': 'Loser Break Points Saved',
'l_bpSaved': 'Loser Break Points Faced'
}, inplace=True)
tennis_data
| Year | Tournament ID | Tournament Name | Type of Surface | Level of Tournament | Winner ID | Name of Winner | Winning Player's Hand | Winning Player's Height (in cm) | Winning Player's Country | ... | Winner Break Points Saved | Winner Break Points Faced | Loser Aces | Loser Service Points | Loser First Serves In | Loser Service Games Played | Loser Break Points Faced | Loser Break Points Saved | Winning Player's Rank | Losing Player's Rank | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2001 | 2001-301 | Auckland | Hard | A | 102905 | Stefan Koubek | L | 175.0 | AUT | ... | 14.0 | 16.0 | 5.0 | 66.0 | 31.0 | 10.0 | 4.0 | 8.0 | 52.0 | 14.0 |
| 1 | 2001 | 2001-520 | Roland Garros | Clay | G | 103990 | Tommy Robredo | R | 180.0 | ESP | ... | 0.0 | 1.0 | 0.0 | 70.0 | 29.0 | 12.0 | 9.0 | 16.0 | 83.0 | 205.0 |
| 2 | 2001 | 2001-520 | Roland Garros | Clay | G | 103694 | Olivier Rochus | R | 168.0 | BEL | ... | 9.0 | 16.0 | 14.0 | 118.0 | 61.0 | 19.0 | 12.0 | 20.0 | 70.0 | 21.0 |
| 3 | 2001 | 2001-520 | Roland Garros | Clay | G | 102338 | Yevgeny Kafelnikov | R | 190.0 | RUS | ... | 13.0 | 21.0 | 6.0 | 150.0 | 87.0 | 21.0 | 12.0 | 20.0 | 7.0 | 112.0 |
| 4 | 2001 | 2001-520 | Roland Garros | Clay | G | 103507 | Juan Carlos Ferrero | R | 183.0 | ESP | ... | 1.0 | 1.0 | 1.0 | 71.0 | 38.0 | 11.0 | 17.0 | 24.0 | 4.0 | 220.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67703 | 2023 | 2023-0403 | Miami Masters | Hard | M | 111460 | Quentin Halys | R | 191.0 | FRA | ... | 6.0 | 6.0 | 9.0 | 82.0 | 49.0 | 11.0 | 7.0 | 9.0 | 79.0 | 55.0 |
| 67704 | 2023 | 2023-0403 | Miami Masters | Hard | M | 106421 | Daniil Medvedev | R | 198.0 | RUS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.0 | 56.0 |
| 67705 | 2023 | 2023-0403 | Miami Masters | Hard | M | 202103 | Francisco Cerundolo | R | 185.0 | ARG | ... | 1.0 | 2.0 | 3.0 | 61.0 | 33.0 | 10.0 | 2.0 | 6.0 | 31.0 | 6.0 |
| 67706 | 2023 | 2023-0403 | Miami Masters | Hard | M | 206173 | Jannik Sinner | R | 188.0 | ITA | ... | 1.0 | 3.0 | 1.0 | 58.0 | 32.0 | 10.0 | 6.0 | 11.0 | 11.0 | 27.0 |
| 67707 | 2023 | 2023-M-DC-2023-WG2-PO-TUN-CYP-01 | Davis Cup WG2 PO: TUN vs CYP | Hard | D | 121411 | Moez Echargui | R | 180.0 | TUN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 364.0 | 990.0 |
67708 rows × 32 columns
After dropping some columns and renaming the columns, we want to make sure there are no duplicate observations present to skew the analysis.
# Check for duplicates:
duplicates = tennis_data.duplicated()
number_of_duplicates = duplicates.sum()
print(f'Number of duplicate rows: {number_of_duplicates}')
Number of duplicate rows: 0
.dropna() or Imputing .fillna() method. Usually filtering a small proportion of data is okay but since this dataset consists more than 60,000 rows, filtering may affect the quality of the data.#provide a summary of the descriptive statistics of the entire tennis data
tennis_data.describe()
| Year | Winner ID | Winning Player's Height (in cm) | Winner's Age | Loser ID | Losing Player's Height (in cm) | Loser's Age | Game Duration (minutes) | Winner Aces | Winner Service Points | ... | Winner Break Points Saved | Winner Break Points Faced | Loser Aces | Loser Service Points | Loser First Serves In | Loser Service Games Played | Loser Break Points Faced | Loser Break Points Saved | Winning Player's Rank | Losing Player's Rank | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 67708.000000 | 67708.000000 | 66190.000000 | 67704.000000 | 67708.000000 | 64635.000000 | 67702.000000 | 60208.000000 | 61742.000000 | 61742.000000 | ... | 61742.000000 | 61742.000000 | 61742.000000 | 61742.000000 | 61742.000000 | 61743.000000 | 61742.000000 | 61742.000000 | 67211.000000 | 66388.000000 |
| mean | 2011.380250 | 108946.000177 | 186.200030 | 26.346919 | 109010.336873 | 185.667796 | 26.449805 | 106.949575 | 6.923504 | 77.945953 | ... | 3.451378 | 5.018010 | 5.113116 | 81.016342 | 48.665382 | 12.308294 | 4.766788 | 8.604516 | 79.346759 | 117.933256 |
| std | 6.541564 | 18394.042835 | 6.827807 | 3.971546 | 18438.182738 | 6.781806 | 4.088679 | 41.292377 | 5.553292 | 29.217210 | ... | 3.070134 | 4.024942 | 4.899089 | 29.215464 | 19.260585 | 4.232881 | 3.265035 | 4.138488 | 138.557775 | 186.546855 |
| min | 2001.000000 | 100644.000000 | 163.000000 | 14.900000 | 100644.000000 | 163.000000 | 14.500000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| 25% | 2006.000000 | 103656.000000 | 183.000000 | 23.400000 | 103580.000000 | 183.000000 | 23.500000 | 77.000000 | 3.000000 | 56.000000 | ... | 1.000000 | 2.000000 | 2.000000 | 60.000000 | 35.000000 | 9.000000 | 2.000000 | 6.000000 | 18.000000 | 36.000000 |
| 50% | 2011.000000 | 104468.000000 | 185.000000 | 26.100000 | 104468.000000 | 185.000000 | 26.300000 | 99.000000 | 6.000000 | 73.000000 | ... | 3.000000 | 4.000000 | 4.000000 | 76.000000 | 45.000000 | 11.000000 | 4.000000 | 8.000000 | 45.000000 | 68.000000 |
| 75% | 2017.000000 | 105332.000000 | 190.000000 | 29.000000 | 105453.000000 | 190.000000 | 29.300000 | 130.000000 | 9.000000 | 94.000000 | ... | 5.000000 | 7.000000 | 7.000000 | 97.000000 | 59.000000 | 15.000000 | 7.000000 | 11.000000 | 85.000000 | 114.000000 |
| max | 2023.000000 | 211468.000000 | 211.000000 | 42.300000 | 212041.000000 | 211.000000 | 46.000000 | 1146.000000 | 113.000000 | 491.000000 | ... | 24.000000 | 30.000000 | 103.000000 | 489.000000 | 328.000000 | 91.000000 | 27.000000 | 38.000000 | 2101.000000 | 2159.000000 |
8 rows × 22 columns
# checking for all the null values in the dataset
tennis_data.isnull()
| Year | Tournament ID | Tournament Name | Type of Surface | Level of Tournament | Winner ID | Name of Winner | Winning Player's Hand | Winning Player's Height (in cm) | Winning Player's Country | ... | Winner Break Points Saved | Winner Break Points Faced | Loser Aces | Loser Service Points | Loser First Serves In | Loser Service Games Played | Loser Break Points Faced | Loser Break Points Saved | Winning Player's Rank | Losing Player's Rank | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67703 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 67704 | False | False | False | False | False | False | False | False | False | False | ... | True | True | True | True | True | True | True | True | False | False |
| 67705 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 67706 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 67707 | False | False | False | False | False | False | False | False | False | False | ... | True | True | True | True | True | True | True | True | False | False |
67708 rows × 32 columns
The following Python code uses the matplotlib and seaborn libraries to create a heatmap that visualizes the distribution of NaN (missing) values in the tennis_data DataFrame.
The Heatmap is a data technique, that can be used to show the distribution of NaN values across variables and observations.
plt.figure(figsize=(10, 12))
sns.heatmap(tennis_data.isna(), cbar=False)
plt.title('Heatmap of NaN Values in the Dataset')
plt.show()
1). NaN values are spread out evenly.
2). The columns on the right side have much more NaN values.
3). The Game duration variable has a large gap of NaN values.
Using the .dropna() method to compare the descriptive statistics with the original data to check for disparities and the effect on the analysis.
# will drop all the rows with null values, axis = 0
drop_tennis_data_check = tennis_data.dropna()
# will describe the descriptive statistics of the tennis data after filtering the values.
drop_tennis_data_check.describe()
| Year | Winner ID | Winning Player's Height (in cm) | Winner's Age | Loser ID | Losing Player's Height (in cm) | Loser's Age | Game Duration (minutes) | Winner Aces | Winner Service Points | ... | Winner Break Points Saved | Winner Break Points Faced | Loser Aces | Loser Service Points | Loser First Serves In | Loser Service Games Played | Loser Break Points Faced | Loser Break Points Saved | Winning Player's Rank | Losing Player's Rank | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | ... | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 | 58946.000000 |
| mean | 2011.403725 | 108704.925780 | 186.249347 | 26.449637 | 108421.738456 | 185.673328 | 26.646149 | 107.206630 | 6.908136 | 77.949988 | ... | 3.455502 | 5.024226 | 5.115445 | 80.989957 | 48.639874 | 12.303973 | 4.765650 | 8.595766 | 59.395413 | 89.983867 |
| std | 6.564967 | 17800.676578 | 6.816407 | 3.934795 | 17058.869881 | 6.774238 | 3.994296 | 41.175354 | 5.530789 | 29.140404 | ... | 3.071682 | 4.022039 | 4.895310 | 29.156173 | 19.216700 | 4.222530 | 3.262421 | 4.133036 | 73.487907 | 110.709605 |
| min | 2001.000000 | 100644.000000 | 168.000000 | 15.800000 | 100644.000000 | 163.000000 | 15.400000 | 3.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| 25% | 2006.000000 | 103607.000000 | 183.000000 | 23.600000 | 103507.000000 | 183.000000 | 23.700000 | 77.000000 | 3.000000 | 57.000000 | ... | 1.000000 | 2.000000 | 2.000000 | 60.000000 | 35.000000 | 9.000000 | 2.000000 | 6.000000 | 17.000000 | 35.000000 |
| 50% | 2011.000000 | 104460.000000 | 185.000000 | 26.200000 | 104372.000000 | 185.000000 | 26.500000 | 100.000000 | 6.000000 | 73.000000 | ... | 3.000000 | 4.000000 | 4.000000 | 76.000000 | 45.000000 | 11.000000 | 4.000000 | 8.000000 | 42.000000 | 65.000000 |
| 75% | 2017.000000 | 105238.000000 | 190.000000 | 29.100000 | 105376.000000 | 190.000000 | 29.400000 | 130.000000 | 9.000000 | 94.000000 | ... | 5.000000 | 7.000000 | 7.000000 | 97.000000 | 59.000000 | 15.000000 | 7.000000 | 11.000000 | 78.000000 | 104.000000 |
| max | 2023.000000 | 209950.000000 | 211.000000 | 42.300000 | 211663.000000 | 211.000000 | 44.000000 | 1146.000000 | 113.000000 | 491.000000 | ... | 24.000000 | 30.000000 | 103.000000 | 489.000000 | 328.000000 | 91.000000 | 27.000000 | 38.000000 | 1890.000000 | 2146.000000 |
8 rows × 22 columns
Using the .fillna() method to compare the descriptive statistics with the original data to check for disparities and the effect on the analysis.
# will fill all the NaN values with 0, axis = 0
clean_tennis_data = tennis_data.fillna(0)
# will describe the descriptive statistics of the tennis data after filtering the values
clean_tennis_data.describe()
| Year | Winner ID | Winning Player's Height (in cm) | Winner's Age | Loser ID | Losing Player's Height (in cm) | Loser's Age | Game Duration (minutes) | Winner Aces | Winner Service Points | ... | Winner Break Points Saved | Winner Break Points Faced | Loser Aces | Loser Service Points | Loser First Serves In | Loser Service Games Played | Loser Break Points Faced | Loser Break Points Saved | Winning Player's Rank | Losing Player's Rank | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | ... | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | 67708.000000 | 67708.00000 | 67708.000000 | 67708.000000 |
| mean | 2011.380250 | 108946.000177 | 182.025462 | 26.345362 | 109010.336873 | 177.241065 | 26.447461 | 95.102794 | 6.313449 | 71.077849 | ... | 3.147265 | 4.575855 | 4.662580 | 73.877695 | 44.377297 | 11.223947 | 4.346768 | 7.84634 | 78.764326 | 115.634091 |
| std | 6.541564 | 18394.042835 | 28.380673 | 3.976588 | 18438.182738 | 39.210845 | 4.096072 | 51.408822 | 5.654492 | 35.589375 | ... | 3.090682 | 4.098286 | 4.897643 | 36.134858 | 22.990836 | 5.339438 | 3.398067 | 4.64402 | 138.214358 | 185.437718 |
| min | 2001.000000 | 100644.000000 | 0.000000 | 0.000000 | 100644.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 25% | 2006.000000 | 103656.000000 | 183.000000 | 23.400000 | 103580.000000 | 180.000000 | 23.500000 | 69.000000 | 2.000000 | 52.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 55.000000 | 32.000000 | 9.000000 | 2.000000 | 5.00000 | 17.000000 | 34.000000 |
| 50% | 2011.000000 | 104468.000000 | 185.000000 | 26.100000 | 104468.000000 | 185.000000 | 26.300000 | 93.000000 | 5.000000 | 69.000000 | ... | 2.000000 | 4.000000 | 3.000000 | 72.000000 | 43.000000 | 11.000000 | 4.000000 | 8.00000 | 44.000000 | 67.000000 |
| 75% | 2017.000000 | 105332.000000 | 190.000000 | 29.000000 | 105453.000000 | 190.000000 | 29.300000 | 125.000000 | 9.000000 | 92.000000 | ... | 5.000000 | 7.000000 | 7.000000 | 95.000000 | 57.000000 | 14.000000 | 6.000000 | 11.00000 | 85.000000 | 113.000000 |
| max | 2023.000000 | 211468.000000 | 211.000000 | 42.300000 | 212041.000000 | 211.000000 | 46.000000 | 1146.000000 | 113.000000 | 491.000000 | ... | 24.000000 | 30.000000 | 103.000000 | 489.000000 | 328.000000 | 91.000000 | 27.000000 | 38.00000 | 2101.000000 | 2159.000000 |
8 rows × 22 columns
Upon experimenting with both methods and also based on the heatmap visualization, it is evident that imputing values resulted in nearly identical descriptive statistics when compared to the original dataset. This suggests that imputation is a preferable option for preparing the data, ensuring the most accurate analysis possible.
print(clean_tennis_data.columns.tolist())
['Year', 'Tournament ID', 'Tournament Name', 'Type of Surface', 'Level of Tournament', 'Winner ID', 'Name of Winner', "Winning Player's Hand", "Winning Player's Height (in cm)", "Winning Player's Country", "Winner's Age", 'Loser ID', 'Name of Losing Player', "Losing Player's Hand", "Losing Player's Height (in cm)", "Losing Player's Country", "Loser's Age", 'Game Duration (minutes)', 'Winner Aces', 'Winner Service Points', 'Winner First Serves In', 'Winner Service Games Played', 'Winner Break Points Saved', 'Winner Break Points Faced', 'Loser Aces', 'Loser Service Points', 'Loser First Serves In', 'Loser Service Games Played', 'Loser Break Points Faced', 'Loser Break Points Saved', "Winning Player's Rank", "Losing Player's Rank"]
value_counts(): Returns an object containing counts of unique values. This means it enables us to count the number of unique elements in a column of a Pandas dataframe.sort_values(): The default value of this function is to sort in ascending order. However, you can pass an argument to the function that disables the default behavior, which will return the values in descending order.isin(): Checks if the Dataframe contains the specified value(s). It returns a DataFrame similar to the original DataFrame, but the original values have been replaced with True if the value was one of the specified values, otherwise False .groupby():Grouping the data points based on the distinct values in the given column or columns. We can then calculate aggregated values for the generated groups.merge: Updates the content of two DataFrame by merging them togetherIn order to do the analysis for this question, We will need to first find out what is our top player. Our team decided to consider the top five players that has the most win in the dataset as the top players that we do the analysis on. In order to get the number of the wins that the top player get on each type of surface, we will need to do some data filtering, select the needed column and count the filtered data.
.value_counts() to count the filtered data..sort_values(ascending = False) to make the list in a descending order.sorted_win_count = clean_tennis_data['Name of Winner'].value_counts().sort_values(ascending=False)
sorted_win_count.head(10)
# Add columns specifying the type of surface and hand - either left or right.
Roger Federer 1214 Novak Djokovic 1083 Rafael Nadal 1079 David Ferrer 740 Andy Murray 738 Tomas Berdych 643 Andy Roddick 608 Richard Gasquet 605 Marin Cilic 586 Stan Wawrinka 566 Name: Name of Winner, dtype: int64
# Filtering data for top players
top_players = clean_tennis_data['Name of Winner'].value_counts().head(5).index.tolist()
top_players_data = clean_tennis_data[clean_tennis_data['Name of Winner'].isin(top_players)]
# Analyzing performance on different surfaces
winner_surface_performance = top_players_data.groupby(["Winner ID",'Name of Winner', 'Type of Surface',"Winning Player's Hand"]).agg(
Total_Wins=pd.NamedAgg(column='Name of Winner', aggfunc='count'),
)
# Resetting index for better readability
winner_surface_performance.reset_index(inplace=True)
top_winners_data = pd.DataFrame(winner_surface_performance)
After some data processing, we get the top five player's numbers of win on each type of surface. We can see that all the player win the most game on Hard surface.We think that the result is this way because all the matches are play mostly on hard surface, which doesn't represent that they perform better on hard surface. So we should conduct further analysis like finding the their number of losses are each type of surfaces and their winrate. To achieve this, we will have to calculate the number of times they appear on the Name of losing player and calculate their winrate.
loser_surface_performance = top_players_data.groupby(["Name of Losing Player", "Type of Surface","Losing Player's Hand"]).agg(
Total_Lost=pd.NamedAgg(column="Name of Losing Player", aggfunc='count'),
)
loser_surface_performance.reset_index(inplace=True)
lost_count = pd.DataFrame(loser_surface_performance)
lost_count
# Merging the winner and loser dataframe by names and type of surfaces
win_percentage = winner_surface_performance.merge(lost_count, left_on=['Name of Winner', 'Type of Surface'],
right_on=['Name of Losing Player', 'Type of Surface'])
# Drop the duplicate columns
win_percentage.drop(columns=["Name of Losing Player","Losing Player's Hand"], inplace = True)
# Find the total number of game played on each surface
win_percentage['Total_Game_Played'] = win_percentage['Total_Wins'] + win_percentage['Total_Lost']
# Filtering data for top players
top_players = clean_tennis_data['Name of Winner'].value_counts().head(5).index.tolist()
top_players_data = clean_tennis_data[clean_tennis_data['Name of Winner'].isin(top_players)]
# Analyzing performance on different surfaces
winner_surface_performance = top_players_data.groupby(["Winner ID",'Name of Winner', 'Type of Surface',"Winning Player's Hand"]).agg(
Total_Wins=pd.NamedAgg(column='Name of Winner', aggfunc='count'),
)
# Resetting index for better readability
winner_surface_performance.reset_index(inplace=True)
# Filtering data for top players
top_players = clean_tennis_data['Name of Winner'].value_counts().head(5).index.tolist()
top_players_data = clean_tennis_data[clean_tennis_data['Name of Winner'].isin(top_players)]
# Analyzing performance on different surfaces
winner_surface_performance = top_players_data.groupby(["Winner ID",'Name of Winner', 'Type of Surface',"Winning Player's Hand"]).agg(
Total_Wins=pd.NamedAgg(column='Name of Winner', aggfunc='count'),
)
# Resetting index for better readability
winner_surface_performance.reset_index(inplace=True)
top_winners_data = pd.DataFrame(winner_surface_performance)
top_winners_data
# Calculate 'Win_Percentage' column by dividing 'Total_Wins' by 'Total_Game_Played' and multiplying by 100
win_percentage['Win_Percentage'] = win_percentage['Total_Wins']/win_percentage['Total_Game_Played']*100
win_percentage
| Winner ID | Name of Winner | Type of Surface | Winning Player's Hand | Total_Wins | Total_Lost | Total_Game_Played | Win_Percentage | |
|---|---|---|---|---|---|---|---|---|
| 0 | 103819 | Roger Federer | Clay | R | 224 | 18 | 242 | 92.561983 |
| 1 | 103819 | Roger Federer | Grass | R | 192 | 5 | 197 | 97.461929 |
| 2 | 103819 | Roger Federer | Hard | R | 761 | 40 | 801 | 95.006242 |
| 3 | 103970 | David Ferrer | Carpet | R | 9 | 1 | 10 | 90.000000 |
| 4 | 103970 | David Ferrer | Clay | R | 337 | 29 | 366 | 92.076503 |
| 5 | 103970 | David Ferrer | Grass | R | 44 | 1 | 45 | 97.777778 |
| 6 | 103970 | David Ferrer | Hard | R | 350 | 43 | 393 | 89.058524 |
| 7 | 104745 | Rafael Nadal | Clay | L | 478 | 14 | 492 | 97.154472 |
| 8 | 104745 | Rafael Nadal | Grass | L | 76 | 5 | 81 | 93.827160 |
| 9 | 104745 | Rafael Nadal | Hard | L | 523 | 42 | 565 | 92.566372 |
| 10 | 104918 | Andy Murray | Clay | R | 110 | 17 | 127 | 86.614173 |
| 11 | 104918 | Andy Murray | Grass | R | 119 | 5 | 124 | 95.967742 |
| 12 | 104918 | Andy Murray | Hard | R | 501 | 41 | 542 | 92.435424 |
| 13 | 104925 | Novak Djokovic | Clay | R | 275 | 28 | 303 | 90.759076 |
| 14 | 104925 | Novak Djokovic | Grass | R | 115 | 5 | 120 | 95.833333 |
| 15 | 104925 | Novak Djokovic | Hard | R | 684 | 35 | 719 | 95.132128 |
# Filtering matches where one player is right-handed and the other is left-handed
matches_rh_vs_lh_specific = tennis_data[((tennis_data['Winning Player\'s Hand'] == 'R') & (tennis_data['Losing Player\'s Hand'] == 'L')) |
((tennis_data['Winning Player\'s Hand'] == 'L') & (tennis_data['Losing Player\'s Hand'] == 'R'))]
# Group by the handedness of the winner and the type of surface, then count the wins
victories_by_hand_surface_specific = matches_rh_vs_lh_specific.groupby(['Winning Player\'s Hand', 'Type of Surface']).size().reset_index(name='Wins')
# Counting the total number of such matches played on each surface
total_matches_by_surface_specific = matches_rh_vs_lh_specific.groupby('Type of Surface').size().reset_index(name='Total Matches')
# Merging the wins with the total matches to calculate win percentages
victory_percentage_specific = pd.merge(victories_by_hand_surface_specific, total_matches_by_surface_specific, on='Type of Surface')
victory_percentage_specific['Win Percentage'] = (victory_percentage_specific['Wins'] / victory_percentage_specific['Total Matches']) * 100
# Display the DataFrame
victory_percentage_specific[['Type of Surface', 'Winning Player\'s Hand', 'Win Percentage']]
| Type of Surface | Winning Player's Hand | Win Percentage | |
|---|---|---|---|
| 0 | Carpet | L | 44.186047 |
| 1 | Carpet | R | 55.813953 |
| 2 | Clay | L | 50.030272 |
| 3 | Clay | R | 49.969728 |
| 4 | Grass | L | 46.646341 |
| 5 | Grass | R | 53.353659 |
| 6 | Hard | L | 46.372599 |
| 7 | Hard | R | 53.627401 |
#insert the dataframe into the graph
fig = px.bar(winner_surface_performance,
x='Name of Winner',
y='Total_Wins',
# Create the group bar chart
color='Type of Surface',
barmode='group',
#modify the title for the interaction of the bar chart
labels={'Name of Winner': 'Player Name', 'Total_Wins': 'Total Wins'},
title='Performance of Top Tennis Players on Different Surfaces')
# Show the plot
fig.show()
Based on the visualization graph and dataframe, we can see that Rafael Nadal dominate on the Clay surface by having a winrate of 97% with a total game of 492 which shows a overwhelming superiority over other player on Clay surface. Andy Murray has the least win rate with a total game of 127 on Clay which might indictae that he doesn't perform that well on the Clay surface.
Based on the visualization graph and dataframe, we can see that David Ferrer and Roger Federer has the most win rate on Grass. David Ferrer has a win rate of 97.78% with a total game of 45 and Roger Federer has a win rate of 97.49% with a total game of 199 which make them both the most dominating player on Grass surface. But considering Roger Federer has a much larger sample size, so the win rate will be more accurate and has more stability.So we indicate that he is the dominating player on Grass Surface. Rafael Nadal has the lowest winrate of 93.83% with a total game of 81 on Grass Surface, eventhough it is still a very solid win rate, but it place him the lowest rating while playing on Grass surface comparing to the other members in the top 5 list.
Based on the visualization graph and dataframe, we can see that Novak Djokovic and Roger Federer has the most win rate on Hard surface. Novak Djokovic has a win rate of 95.13% with a total game of 719 and Roger Federer has a win rate of 95.15% with a total game of 825 which make them both the most dominating player on Hard surface. David Ferrer has the lowest win rate which is 89.05% out of a total game of 393 which indicates that he doesn't perform very well on the hard surface.
Based on the visualization graph, Andy Murray perform much better on grass with a win rate of 95.97% which might be an indication that he will play better on grass surface. His worst performing surface is Clay where he only has 86.61% which indicate that he is not as good when he play on Clay.
Based on the visualization graph, David Ferrer perform much better on grass where he has a win rate of 97.78% which indicate that he will be performing much better on grass surface. His is not as good when he plays on hard surface with a winrate of 89.06% which serve as an evidence to the claim.
Based on the visualization graph, Novak Djokovic perform much better on grass with a win rate of 95.83% which indicate he will be in his best performance when playing on grass. On the other hand, he has a win rate of 90.76% on clay which shows that he is not as good when he play on clay.
Based on the visualization graph, Rafael Nadal perform much better on clay where he has a win rate of 97.15%. It indicate that he will be a much better self when playing on Clay. On Hard surface, he has the lowest win rate which is 92.57% which shows that he is not as good when he play on Hard surface.
Based on the visualization graph, Roger Federer perform much better on grass where he has a win rate of 97.49% which shows that he barely lose his game when he play on grass. On Clay surface, he has his lowest win rate which is 92.65% which shows that he is not as good when he play on Clay surface.
We see that players are not doing as well when they are playing on carpet surfaces this is also due to the fact that our data is old tennis is not being played on carpet in current time. However, we see that hard surfaces have the highest wins because most of the matches are being played on hard surfaces. This will help the sponsorship brand like Nike and Under Armour partner with a player and work on making equipment to help them improve their game. We would recommend Nike to work with Roger Federer and create a product like sneakers that will help play better on clay surfaces. We would recommend Under Armour to partner with David Ferrer to make a product that will improve him when playing on hard surfaces.
How significantly does a player's age impact their game performance, and can we associate the success of younger players and the difficulties faced by older players with factors like stamina and match duration?
#group the players by age, and then aggregate the average game duration and win counts, for use in visualization
group_by_age = clean_tennis_data.groupby("Winner's Age").agg(
avg_dur=('Game Duration (minutes)', 'mean'),
win_cnt=('Winner ID', 'count')).reset_index()
#create the scatter plot of age vs game duration
fig, ax1 = plt.subplots(figsize=(12, 6))
sns.scatterplot(x="Winner's Age", y='avg_dur', data=group_by_age, ax=ax1, color='blue', label='Average Game Duration')
#basic formatting of chart for title, axis labels, and upper/lower bounds of chart
ax1.set_xlabel('Age of Winner')
ax1.set_xlim(left=12, right=42)
ax1.set_ylabel('Avg Game Duration (minutes)', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_title('Age vs Game Duration and Number of Wins')
#create a secondary axis for number of wins, add line plot, and labels
ax2 = ax1.twinx()
sns.lineplot(x="Winner's Age", y='win_cnt', data=group_by_age, ax=ax2, color='orange', label='Number of Wins')
ax2.set_ylabel('Number of Wins', color='orange')
ax2.tick_params(axis='y', labelcolor='orange')
#place legends in the upper corners for easier viewing
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.show()
This information will help sport entertainment broadcasters understand how long a game might take depending on the age of the players. This will make the casting and streaming of the game more precise and they will be able to allocate a specific amount of time for the game and how much ad should be streamed during the game and when. For example if the players in a competition are between the ages of 40 - 44 they know that the match will take 111.178 minutes which is about 2 hours specifically 1 hour and 52 minutes. ESPN and BIeN Sports will know how long it will take and can plan what they will be showing after the match and have the line up of what will be playing on that day. The longer the game is the less time they will have to show other broadcasts.
During the clash of the Legends i.e Djokovic/ Nadal or Djokovic/Federer or Federer/Nadal, which player proved to be a winner majority of the time against their opponent and what was the reason behind it? Is the court/season, hand, age and height the reason for the winner’s success?
Analysis based on tournaments
def clash(players):
clash_data = clean_tennis_data[clean_tennis_data['Name of Winner'].isin(players) & clean_tennis_data['Name of Losing Player'].isin(players)]
winners = clash_data['Name of Winner'].value_counts()
tournament_winners_count = clash_data[['Tournament Name','Name of Winner']].value_counts().reset_index(name='Count')
return clash_data, winners,tournament_winners_count
players = ['Novak Djokovic', 'Rafael Nadal', 'Roger Federer']
clash_data, winners, tournament_winners_count = clash(players)
winners
Novak Djokovic 58 Rafael Nadal 53 Roger Federer 40 Name: Name of Winner, dtype: int64
tournament_winners_count
| Tournament Name | Name of Winner | Count | |
|---|---|---|---|
| 0 | Roland Garros | Rafael Nadal | 14 |
| 1 | Rome Masters | Rafael Nadal | 8 |
| 2 | Tour Finals | Novak Djokovic | 7 |
| 3 | Indian Wells Masters | Novak Djokovic | 6 |
| 4 | Rome Masters | Novak Djokovic | 6 |
| 5 | Australian Open | Novak Djokovic | 6 |
| 6 | Wimbledon | Novak Djokovic | 5 |
| 7 | Tour Finals | Roger Federer | 5 |
| 8 | Monte Carlo Masters | Rafael Nadal | 5 |
| 9 | Madrid Masters | Rafael Nadal | 4 |
| 10 | US Open | Novak Djokovic | 4 |
| 11 | Miami Masters | Novak Djokovic | 4 |
| 12 | Wimbledon | Roger Federer | 4 |
| 13 | Cincinnati Masters | Roger Federer | 3 |
| 14 | Shanghai Masters | Roger Federer | 3 |
| 15 | Paris Masters | Novak Djokovic | 3 |
| 16 | Roland Garros | Novak Djokovic | 3 |
| 17 | Cincinnati Masters | Novak Djokovic | 3 |
| 18 | Indian Wells Masters | Roger Federer | 3 |
| 19 | Monte Carlo Masters | Roger Federer | 3 |
| 20 | Dubai | Roger Federer | 3 |
| 21 | US Open | Roger Federer | 3 |
| 22 | Australian Open | Rafael Nadal | 3 |
| 23 | Australian Open | Roger Federer | 2 |
| 24 | Tour Finals | Rafael Nadal | 2 |
| 25 | Miami Masters | Roger Federer | 2 |
| 26 | US Open | Rafael Nadal | 2 |
| 27 | Beijing | Novak Djokovic | 2 |
| 28 | Canada Masters | Novak Djokovic | 2 |
| 29 | Wimbledon | Rafael Nadal | 2 |
| 30 | Basel | Roger Federer | 2 |
| 31 | Monte Carlo Masters | Novak Djokovic | 2 |
| 32 | Miami Masters | Rafael Nadal | 2 |
| 33 | Indian Wells Masters | Rafael Nadal | 2 |
| 34 | Masters Cup | Roger Federer | 2 |
| 35 | Hamburg Masters | Rafael Nadal | 2 |
| 36 | Davis Cup WG R1: ESP vs SRB | Rafael Nadal | 1 |
| 37 | Doha | Novak Djokovic | 1 |
| 38 | Dubai | Novak Djokovic | 1 |
| 39 | Dubai | Rafael Nadal | 1 |
| 40 | Basel | Novak Djokovic | 1 |
| 41 | Hamburg Masters | Roger Federer | 1 |
| 42 | Cincinnati Masters | Rafael Nadal | 1 |
| 43 | Madrid Masters | Novak Djokovic | 1 |
| 44 | Davis Cup WG PO: SUI vs SCG | Roger Federer | 1 |
| 45 | Beijing Olympics | Rafael Nadal | 1 |
| 46 | Roland Garros | Roger Federer | 1 |
| 47 | Madrid Masters | Roger Federer | 1 |
| 48 | Masters Cup | Rafael Nadal | 1 |
| 49 | Queen's Club | Rafael Nadal | 1 |
| 50 | Canada Masters | Roger Federer | 1 |
| 51 | Canada Masters | Rafael Nadal | 1 |
| 52 | Atp Cup | Novak Djokovic | 1 |
custom_colors = {'Category1': 'blue', 'Category2': 'yellow', 'Category3': 'black'}
fig = px.bar(tournament_winners_count.reset_index(), x='Tournament Name', y='Count',
labels={'Count': 'Number of Winners'},
color='Name of Winner',
title='Number of Winners per Tournament',
hover_name='Count',
color_discrete_map = custom_colors)
fig.update_layout(
xaxis=dict(tickangle=-90, showgrid=False),
yaxis=dict(showgrid=False),
height=500,
width=800
)
fig.show()
Analysis Based on Surface
def get_player_name(player_id, data):
if player_id in data['Winner ID'].values:
return data[data['Winner ID'] == player_id]['Name of Winner'].iloc[0]
elif player_id in data['Loser ID'].values:
return data[data['Loser ID'] == player_id]['Name of Losing Player'].iloc[0]
else:
return "Unknown Player"
def create_player_profile(player_id, data, surface=None):
player_name = get_player_name(player_id, data)
if surface and surface.strip():
player_matches = data[((data['Winner ID'] == player_id) | (data['Loser ID'] == player_id)) & (data['Type of Surface'] == surface)]
else:
player_matches = data[(data['Winner ID'] == player_id) | (data['Loser ID'] == player_id)]
total_matches = len(player_matches)
total_wins = len(player_matches[player_matches['Winner ID'] == player_id])
average_aces = (player_matches[(player_matches['Winner ID'] == player_id)]['Winner Aces'].mean() +
player_matches[(player_matches['Loser ID'] == player_id)]['Loser Aces'].mean())
total_bp_saved = player_matches[(player_matches['Winner ID'] == player_id)]['Winner Break Points Saved'].mean() + player_matches[(player_matches['Loser ID'] == player_id)]['Loser Break Points Saved'].mean()
total_bp_faced = player_matches[(player_matches['Winner ID'] == player_id)]['Winner Break Points Faced'].mean() + player_matches[(player_matches['Loser ID'] == player_id)]['Loser Break Points Faced'].mean()
profile = {
'Player ID': player_id,
'Player Name': player_name,
'Total Matches': total_matches,
'Total Wins': total_wins,
'Average Aces per Match': average_aces,
'BP Saved': total_bp_saved,
'BP Faced': total_bp_faced
}
return profile
def compare_players(player_ids, data, surface=None):
profiles = [create_player_profile(player_id, data, surface) for player_id in player_ids]
# Prepare data for the bar chart
categories = ['Total Matches', 'Total Wins', 'Average Aces per Match', 'BP Saved', 'BP Faced']
traces = []
colors = ['violet', 'orange', 'blue']
for i, player_id in enumerate(player_ids):
player_values = [profiles[i][cat] for cat in categories]
trace = go.Bar(x=categories, y=player_values, name=f'{profiles[i]["Player Name"]} (ID: {player_id})', marker_color=colors[i])
traces.append(trace)
# Create the figure
fig = go.Figure(data=traces)
fig.update_layout(title_text='Head-to-Head comparison of players', barmode='group')
return fig
def filter_head_to_head_matches(player_ids, data, surface=None):
if surface:
surface = surface.lower()
head_to_head = data[
(((data['Winner ID'] == player_ids[0]) & (data['Loser ID'].isin(player_ids[1:])) |
(data['Winner ID'] == player_ids[1]) & (data['Loser ID'].isin([player_ids[0], player_ids[2]])) |
(data['Winner ID'] == player_ids[2]) & (data['Loser ID'].isin([player_ids[0], player_ids[1]]))) &
(data['Type of Surface'].str.lower() == surface))
]
else:
head_to_head = data[
(((data['Winner ID'] == player_ids[0]) & (data['Loser ID'].isin(player_ids[1:])) |
(data['Winner ID'] == player_ids[1]) & (data['Loser ID'].isin([player_ids[0], player_ids[2]])) |
(data['Winner ID'] == player_ids[2]) & (data['Loser ID'].isin([player_ids[0], player_ids[1]]))))
]
return head_to_head
def create_player_head_to_head_profile(player_id, head_to_head_matches):
total_matches = len(head_to_head_matches)
total_wins = len(head_to_head_matches[head_to_head_matches['Winner ID'] == player_id])
average_aces = head_to_head_matches[head_to_head_matches['Winner ID'] == player_id]['Winner Aces'].mean()
win_bp_saved = head_to_head_matches[head_to_head_matches['Winner ID'] == player_id]['Winner Break Points Saved'].sum()
win_bp_faced = head_to_head_matches[head_to_head_matches['Winner ID'] == player_id]['Winner Break Points Faced'].sum()
loser_bp_saved = head_to_head_matches[head_to_head_matches['Loser ID'] == player_id]['Loser Break Points Saved'].sum()
loser_bp_faced = head_to_head_matches[head_to_head_matches['Loser ID'] == player_id]['Loser Break Points Faced'].sum()
profile = {
'Player ID': player_id,
'Total Matches': total_matches,
'Total Wins': total_wins,
'Average Aces per Match': average_aces,
'Winner BP Saved': win_bp_saved,
'Winner BP Faced': win_bp_faced,
'Loser BP Saved': loser_bp_saved,
'Loser BP Faced': loser_bp_faced,
}
return profile
# Example usage with user input for three players
surface = input("Enter the surface type (Clay, Grass, Hard) or leave blank for all surfaces: ").strip()
player1_id = 103819 # Replace with the actual ID of the first player
player2_id = 104745 # Replace with the actual ID of the second player
player3_id = 104925 # Replace with the actual ID of the third player
player_ids = [player1_id, player2_id, player3_id]
# Filter for head-to-head matches on the specified surface
head_to_head_matches = filter_head_to_head_matches(player_ids, clean_tennis_data, surface)
# Create profiles based on head-to-head matches
player1_profile = create_player_head_to_head_profile(player1_id, head_to_head_matches)
player2_profile = create_player_head_to_head_profile(player2_id, head_to_head_matches)
player3_profile = create_player_head_to_head_profile(player3_id, head_to_head_matches)
# Compare and visualize the profiles
fig = compare_players(player_ids, head_to_head_matches, surface)
fig.show()
Based on the graph, we can infer that companies should consider players as the face of their brand based on the number of wins a player had in that geographic location where the company has a majority audience as compared to a player who has won multiple matches overall.
For client Lacoste: Lacoste would benefit from choosing Nadal as their brand ambassador for France, even though he has lost more matches than other athletes. Nadal's impressive record at French tournaments like Roland Garros and Monte Carlo Masters makes him a strong choice for reaching the French audience.
For client Sergio Tacchini: Sergio Tacchini would benefit from choosing Nadal as their brand ambassador for Italy, even though he has lost more matches than other athletes. Nadal's impressive record at Italian tournaments like Rome Masters held in Rome. Also in Tour Finals held in Turin, Nadal is the only player to have won.
For clients Under Armour and Nike: Under Armour and Nike would benefit from choosing Novak Djokovic as their brand ambassador for USA. Djokovic's impressive record at American tournaments like Indian Wells Masters, Miami Masters, Cincinnati Masters makes him a perfect candidate
This suggests that companies prioritize local relevance and cultural connection over global dominance when choosing brand ambassadors. From the above graph we can interpret that there is no one major winner that has majorly won. Based on various tournaments and locations there are different winners so assigning one winner to every company is not correct. Hence every company based on their geographic location would be better if they keep their geographic winner as their brand ambassador which would attract more crowds as compared to a player who has won more but has not been successful geographically.
# Add a column to the data to better visualize the data
clean_tennis_data['Point Color'] = clean_tennis_data.apply(lambda row: 'blue' if row['Winner Aces'] > row['Loser Aces'] else 'orange', axis=1)
# Create a jointplot with scatter plot and area chart
jointplot = sns.jointplot(data=clean_tennis_data, x='Winner Aces', y='Loser Aces', hue='Point Color',
palette={'blue': 'blue', 'orange': 'orange'}, kind="scatter",
xlim=(0, 60), ylim=(0, 60), height=8, ratio=4, marginal_ticks=True, legend=False)
# Add a trend line to the scatter plot
sns.regplot(data=clean_tennis_data, x='Winner Aces', y='Loser Aces', scatter=False, ax=jointplot.ax_joint, color='#000000')
# Format the chart
plt.suptitle('Winner vs Loser Aces in Tennis Matches', y=1.02)
plt.xlabel('Winner Aces')
plt.ylabel('Loser Aces')
plt.show()
The main motive of this analysis is to determine a relationship between the number of aces delivered during a match by a particular player and whether that accounts for the player's victory. This relationship is showcased using the joint plot function of the seaborn library with a scatter plot and marginal histograms. The points on the plot are color-coded based on the number of aces scored by the winner and the loser. Blue points are used to indicate the winners with more aces while orange points are used to indicate the opposite scenario.
Out of the two histograms, the Top Histogram is used to showcase the winner aces while the Right Histogram depicts the loser aces. Through the trend line, it can be observed that there is a positive relationship between the winner and the loser aces. Through the plot, the majority of the blue points suggest that there is a consistent trend where the winner of the game has played more aces than the loser. On the other side, the orange points depict that the number of matches in which the loser scored more aces is significantly in the minority. Hence, to sum it up we can say that aces play a crucial role in tennis matches and can help in strategic dominance over the opponents. Through the graph, it can be stated that the winner tends to have more aces than the loser in the matches. Hence for most cases, More Aces = Winner
This information holds value for the betting company because it will let people choose the player to bet on depending on the number of aces being scored in the match. Betting companies nowadays allow real-time options as well to place bets, the bettors can keep this factor in mind when placing a bet. When Draftkings Sportsbook shows the data provided by us then bettors are able to see that in the past the winner of a specific game usually has the most aces. This information will allow betters to place a bet in which they can guess what the amount of aces and who will win. Also, this information carries a lot of importance for our clients as the sports broadcasters can use this analysis for commentary purposes and compare the tendency of playing aces for particular players.
The cells below use machine learning to predict winner between 2 players from the clean tennis dataset.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
LabelEncoder to convert player names to numerical codes, making them easier and suitable for machine learning.surface column is one-hot encoded to handle categorical data, representing the match surface type.# Make a column of unique player names
all_player_names = pd.concat([clean_tennis_data['Name of Winner'], clean_tennis_data['Name of Losing Player']]).unique()
label_encoder = LabelEncoder()
label_encoder.fit(all_player_names)
clean_tennis_data['winner_encoded'] = label_encoder.transform(clean_tennis_data['Name of Winner'])
clean_tennis_data['loser_encoded'] = label_encoder.transform(clean_tennis_data['Name of Losing Player'])
# One-hot encode the surface type
surface_encoded = pd.get_dummies(clean_tennis_data['Type of Surface'], prefix='Surface')
clean_tennis_data = pd.concat([clean_tennis_data, surface_encoded], axis=1)
# Add Aces and Break Points Saved features
clean_tennis_data['total_aces'] = clean_tennis_data['Winner Aces'] + clean_tennis_data['Loser Aces']
clean_tennis_data['total_bp_saved'] = clean_tennis_data['Winner Break Points Saved'] + clean_tennis_data['Loser Break Points Saved']
RandomForestClassifier like: Name of Winning Player, Name of Losing Player, Number of Break Points saved by winners and losers and Number of Aces for winners and losers# Prepare features and target
features = clean_tennis_data[['winner_encoded', 'loser_encoded', 'total_aces', 'total_bp_saved'] + list(surface_encoded.columns)]
# Create binary target variable
clean_tennis_data['target'] = clean_tennis_data.apply(lambda x: 1 if x['winner_encoded'] < x['loser_encoded'] else 0, axis=1)
# Splitting the data for training and testing.
X_train, X_test, y_train, y_test = train_test_split(features, clean_tennis_data['target'], test_size=0.2, random_state=42)
RandomForestClassifier is used as it is a popular classification model.# Training the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=42)
# Evaluating the model accuracy
predictions = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))
def predict_match_outcome(player1, player2, surface, model, label_encoder, df, feature_columns):
player1_encoded = label_encoder.transform([player1])[0]
player2_encoded = label_encoder.transform([player2])[0]
surface_vector = [1 if surface == s else 0 for s in feature_columns if s.startswith('Surface_')]
# Determine the number of additional features (excluding player and surface features)
num_additional_features = len(feature_columns) - 2 - len(surface_vector)
# Construct the feature vector
additional_features = [0] * num_additional_features # Placeholder for additional features
features = [player1_encoded, player2_encoded] + surface_vector + additional_features
# Make sure the number of features matches what the model was trained on
assert len(features) == len(feature_columns), "Feature vector does not match the expected length"
# Make a prediction
prediction = model.predict([features])[0]
# Interpret the result
return player1 if prediction == 1 else player2
Accuracy: 0.995938561512332
predict_match_outcome to predict the winner between 2 players. For example use, we've used Andy Murray along with Daniil Medvedev on grass surface.feature_columns = X_train.columns # Assuming X_train is the DataFrame of training features
predicted_winner = predict_match_outcome('Andy Murray', 'Daniil Medvedev', 'grass', model, label_encoder, df, feature_columns)
print("Predicted Winner:", predicted_winner)
Predicted Winner: Andy Murray
C:\Users\ASUS\anaconda3\lib\site-packages\sklearn\base.py:420: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
Summarize your findings and explain why the results are meaningful.
Our main objective as an analytics company is to empower our clients by delivering actionable insights, promoting wise decision-making, and eventually assisting in their businesses' greater profitability . For DraftKing Sportsbook: Based on the analysis we have can suggest which players should Draftking suggest its clients to bet on and as a result the client’s profit will help the growth of the company.
For Nike, Under Armour, Lacoste, Sergio Tacchini : Our Analysis will help increase the sales and reach of the brands in the regions and hence as a result it will foster business development and growth of our client.
For Broadcasting companies: ESPN and Bein sports require their content to reach to a large number of viewers and as a result our insights will help them reach wider audience.